Outline

Framing the problem

  1. Supervised learning
    1. classification: categorical \(y_i\) is available for all \(x_i\)
    2. regression: continuous \(y_i\) is available for all \(x_i\)
  2. Unsupervised learning: \(y_i\) unavailable for all \(x_i\)

What type of problem is this? (1/3)

Food servers’ tips in restaurants may be influenced by many factors, including the nature of the restaurant, size of the party, and table locations in the restaurant. Restaurant managers need to know which factors matter when they assign tables to food servers. For the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair treatment of the servers, for whom tips (at least in restaurants in the United States) are a major component of pay.

In one restaurant, a food server recorded the following data on all customers they served during an interval of two and a half months in early 1990. The restaurant, located in a suburban shopping mall, was part of a national chain and served a varied menu. In observance of local law the restaurant offered seating in a non-smoking section to patrons who requested it. Each record includes a day and time, and taken together, they show the server’s work schedule.

What is \(y\)? What is \(x\)?

What type of problem is this? (2/3)

Every person monitored their email for a week and recorded information about each email message; for example, whether it was spam, and what day of the week and time of day the email arrived. We want to use this information to build a spam filter, a classifier that will catch spam with high probability but will never classify good email as spam.

What is \(y\)? What is \(x\)?

What type of problem is this? (3/3)

A health insurance company collected the following information about households:

  • Total number of doctor visits per year
  • Total household size
  • Total number of hospital visits per year
  • Average age of household members
  • Total number of gym memberships
  • Use of physiotherapy and chiropractic services
  • Total number of optometrist visits

The health insurance company wants to provide a small range of products, containing different bundles of services and for different levels of cover, to market to customers.

What is \(y\)? What is \(x\)?

Model form

All (data-centric) models have a fitted values and residuals.

\[y = f(x_1, x_2, ..., x_p) + \varepsilon\]

where \(y\) is the observed response, \(x_1, ..., x_p\) are the observed values of \(p\) predictors and \(\varepsilon\) is the error. We conventionally use \(n\) to specfify the sample size.

  • We might not be able to exactly specify \(f\) in some forms
  • The residuals have some ideal properties, such as uniform over data space, symmetric around the fitted value, and some models impose a distribution.

Accuracy vs interpretability

Predictive accuracy

The primary purpose is to be able to predict \(\widehat{Y}\) for new data. And we’d like to do that well! That is, accurately.

From XKCD

Interpretability

Almost equally important is that we want to understand the relationship between \({\mathbf X}\) and \(Y\). The simpler model that is (almost) as accurate is the one we choose, always.

Person: Why did you predict 42 for this value?

Computer: Awkward silence

From Interpretable Machine Learning

Parametric vs non-parametric

Parametric methods

  • Assume that the model takes a specific form
  • Fitting then is a matter of estimating the parameters of the model
  • Generally considered to be less flexible
  • If assumptions are wrong, performance likely to be poor

Non-parametric methods

  • No specific assumptions
  • Allow the data to specify the model form, without being too rough or wiggly
  • More flexible
  • Generally needs more observations, and not too many variables
  • Easier to over-fit

From XKCD

Black line is true boundary.


Grids (right) show boundaries for two different models.

Reducible vs irreducible error

If the model form is incorrect, the error (solid circles) may arise from wrong shape, and is thus reducible. Irreducible means that we have got the right model and mistakes (solid circles) are random noise.

Flexible vs inflexible

Parametric models tend to be less flexible but non-parametric models can be flexible or less flexible depending on parameter settings.

Bias vs variance

Bias is the error that is introduced by modeling a complicated problem by a simpler problem.

  • For example, linear regression assumes a linear relationship and perhaps the relationship is not exactly linear.
  • In general, but not always, the more flexible a method is, the less bias it will have because it can fit a complex shape better.

Variance refers to how much your estimate would change if you had different training data. Its measuring how much your model depends on the data you have, to the neglect of future data.

  • In general, the more flexible a method is, the more variance it has.
  • The size of the training data can impact on the variance.

Bias

When you impose too many assumptions with a parametric model, or use an inadequate non-parametric model, such as not letting an algorithm converge fully.

When the model closely captures the true shape, with a parametric model or a flexible model.

Variance

This fit will be virtually identical even if we had a different training sample.

Likely to get a very different model if a different training set is used.

Training vs test splits

When data are reused for multiple tasks, instead of carefully spent from the finite data budget, certain risks increase, such as the risk of accentuating bias or compounding effects from methodological errors. Julia Silge

  • Training set: Used to fit the model, might be also broken into a validation set for frequent assessment of fit.
  • Test set: Purely used to assess final models performance on future data.

Training vs test splits

d_bal <- tibble(y=c(rep("A", 6), rep("B", 6)),
                x=c(runif(12)))
d_bal$y
 [1] "A" "A" "A" "A" "A" "A" "B" "B" "B" "B" "B" "B"
set.seed(130)
d_bal_split <- initial_split(d_bal, prop = 0.70)
training(d_bal_split)$y
[1] "A" "A" "B" "A" "B" "A" "B" "A"
testing(d_bal_split)$y
[1] "A" "B" "B" "B"
d_unb <- tibble(y=c(rep("A", 2), rep("B", 10)),
                x=c(runif(12)))
d_unb$y
 [1] "A" "A" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
set.seed(132)
d_unb_split <- initial_split(d_unb, prop = 0.70)
training(d_unb_split)$y
[1] "B" "B" "A" "B" "B" "A" "B" "B"
testing(d_unb_split)$y
[1] "B" "B" "B" "B"

Always stratify splitting by sub-groups, especially response variable classes.

d_unb_strata <- initial_split(d_unb, prop = 0.70, strata=y)
training(d_unb_strata)$y
[1] "A" "B" "B" "B" "B" "B" "B" "B"
testing(d_unb_strata)$y
[1] "A" "B" "B" "B"

Measuring accuracy for categorical response

Compute \(\widehat{y}\) from training data, \(\{(y_i, {\mathbf x}_i)\}_{i = 1}^n\). The error rate (fraction of misclassifications) to get the Training Error Rate

\[\text{Error rate} = \frac{1}{n}\sum_{i=1}^n I(y_i \ne \widehat{y}({\mathbf x}_i))\]

A better estimate of future accuracy is obtained using test data to get the Test Error Rate.


Training error will usually be smaller than test error. When it is much smaller, it indicates that the model is too well fitted to the training data to be accurate on future data (over-fitted).

Confusion (misclassification) matrix

predicted
1 0
true 1 a b
0 c d

Consider 1=positive (P), 0=negative (N).

  • True positive (TP): a
  • True negative (TN): d
  • False positive (FP): c (Type I error)
  • False negative (FN): b (Type II error)
  • Sensitivity, recall, hit rate, or true positive rate (TPR): TP/P = a/(a+b)
  • Specificity, selectivity or true negative rate (TNR): TN/N = d/(c+d)
  • Prevalence: P/(P+N) = (a+b)/(a+b+c+d)
  • Accuracy: (TP+TN)/(P+N) = (a+d)/(a+b+c+d)
  • Balanced accuracy: (TPR + TNR)/2 (or average class errors)

Confusion (misclassification) matrix: computing

Two classes

# Write out the confusion matrix in standard form
#| label: oconfusion-matrix-tidy
cm <- a2 %>% count(y, pred) |>
  group_by(y) |>
  mutate(cl_acc = n[pred==y]/sum(n)) 
cm |>
  pivot_wider(names_from = pred, 
              values_from = n) |>
  select(y, bilby, quokka, cl_acc)
# A tibble: 2 × 4
# Groups:   y [2]
  y      bilby quokka cl_acc
  <fct>  <int>  <int>  <dbl>
1 bilby      9      3  0.75 
2 quokka     5     10  0.667
accuracy(a2, y, pred) |> pull(.estimate)
[1] 0.7
bal_accuracy(a2, y, pred) |> pull(.estimate)
[1] 0.71
sens(a2, y, pred) |> pull(.estimate)
[1] 0.75
specificity(a2, y, pred) |> pull(.estimate)
[1] 0.67

More than two classes

# Write out the confusion matrix in standard form
cm3 <- a3 %>% count(y, pred) |>
  group_by(y) |>
  mutate(cl_err = n[pred==y]/sum(n)) 
cm3 |>
  pivot_wider(names_from = pred, 
              values_from = n, values_fill=0) |>
  select(y, bilby, quokka, numbat, cl_err)
# A tibble: 3 × 5
# Groups:   y [3]
  y      bilby quokka numbat cl_err
  <fct>  <int>  <int>  <int>  <dbl>
1 bilby      9      3      0  0.75 
2 numbat     0      2      6  0.75 
3 quokka     5     10      0  0.667
accuracy(a3, y, pred) |> pull(.estimate)
[1] 0.71
bal_accuracy(a3, y, pred) |> pull(.estimate)
[1] 0.78

Receiver Operator Curves (ROC)

The balance of getting it right, without predicting everything as positive.

From wikipedia

Need predictive probabilities, probability of being each class.

a2 |> slice_head(n=3)
# A tibble: 3 × 4
  y     pred  bilby quokka
  <fct> <fct> <dbl>  <dbl>
1 bilby bilby   0.9    0.1
2 bilby bilby   0.8    0.2
3 bilby bilby   0.9    0.1
roc_curve(a2, y, bilby) |>
  autoplot()

Preparing to fit model: IDA



The first thing to do with data is to look at them …. usually means tabulating and plotting the data in many different ways to see what’s going on. With the wide availability of computer packages and graphics nowadays there is no excuse for ducking the labour of this preliminary phase, and it may save some red faces later.

Crowder, M. J. & Hand, D. J. (1990) “Analysis of Repeated Measures”

Pre-processing steps

  • Handle missing values: most modeling methods require complete data.
  • Decide on an appropriate model type: Explore relationships between response and predictors: nonlinear, clustered. Different shapes indicate which model will likely fit better
  • Check model assumptions
  • Feature engineering: Do you have the predictors needed, or could you create better predictors? Should some variables be transformed.
  • Check for anomalies: Are there some observations that are unusal and might affect the fit? (We’ll do this at the end, too.)
  • Split data: Decide on how much of the data should be used for training a model, validation, and testing.
  • Reduce the number of predictors: Too many predictors can result in over-fitting the training data.

Check model assumptions

  • Statistical models tend to be less flexible. They impose strict assumptions, such as linear form, and Gaussian errors. Check by plotting response vs predictors to check the relationship is what is assumed, and spread is uniform.
  • Non-parametric models, and algorithms, don’t explicitly impose assumptions, but the process imposes constraints, that means fit will fail if not satisfied. For example, tree-based methods mostly use a single variable for any split. If the relationship with the response is best explained by a combination of variables, this will result in a poor fit.

Feature engineering

Creating new variables to get better fits is a special skill! Sometimes automated by the method. All are transformations of the original variables. (See tidymodels steps.)

  • scaling, centering, sphering (step_pca)
  • log or square root or box-cox transformation (step_log)
  • ratio of values (step_ratio)
  • polynomials or splines: \(x_1^2, x_1^3\) (step_ns)
  • dummy variables: categorical predictors expanded into multiple new binary variables (step_dummy)
  • Convolutional Neural Networks: neural networks but with pre-processing of images to combine values of neighbouring pixels; flattening of images

After fitting: diagnostics

Diagnosing the fit

Compute and examine the usual diagnostics, some methods have more

  • fit statistics: accuracy, sensitivity, specificity
  • errors/misclassifications
  • variable importance
  • plot residuals, examine the misclassifications
  • check test set is similar to training

Go beyond … Look at the data and the model together!

Wickham et al (2015) Removing the Blindfold

Training - plusses; Test - dots

Model-in-the-data-space

From XKCD


We plot the model on the data to assess whether it fits or is a misfit!


Doing this in high dimensions is considered difficult!

So it is common to only plot the data-in-the-model-space.

NOTE, WE CAN PLOT THESE THINGS IN HIGH DIMENSIONS. But this is for another day.

Data-in-the-model-space

Predictive probabilities are aspects of the model. It is useful to plot. What do we learn here?

But it doesn’t tell you why there is a difference.

Model-in-the-data-space

Model is displayed, as a grid of predicted points in the original variable space. Data is overlaid, using text labels. What do you learn?

One model has a linear boundary, and the other has the highly non-linear boundary, which matches the class cluster better. Also …

Plotting residuals

Code
# Load the Jack-o'-Lantern data
d <- jackolantern_surreal_data[sample(1:5395, 2000),]

ggduo(d, c("x1", "x2", "x3", "x4"), "y")

Code
# Fit a linear model to the surreal Jack-o'-Lantern data
d_lm <- lm(y ~ ., data = d)

# Plot the residuals to reveal the hidden image
d_all <- augment(d_lm, d)
ggplot(d_all, aes(x=.fitted, y=.resid)) +
  geom_point() +
  theme(aspect.ratio=1)

Reading residual plots using a lineup

  • Reading residual plots can be extermely hard!
  • You have to decide whether there is NO PATTERN.
  • This is a good place to use the lineup developed to do inference for data visualisation.
Code
x <- lm(tip ~ total_bill, data = tips)
tips.reg <- data.frame(tips, .resid = residuals(x), .fitted = fitted(x))
library(ggplot2)
ggplot(lineup(null_lm(tip ~ total_bill, method = 'rotate'), tips.reg)) +
  geom_point(aes(x = total_bill, y = .resid)) +
  facet_wrap(~ .sample)

Re-sampling statistics

Tidy models

Models and model output

Functions such as lm, glm, lmer, glmmTMB … allow us to fit models

e.g. fit french fry rating with respect to all designed main effects:

ff_long <- french_fries |> pivot_longer(potato:painty, names_to = "type", values_to = "rating")
ff_lm <- lm(rating~type+treatment+time+subject, 
data=ff_long)

summary, str, resid, fitted, coef, … allow us to extract different parts of a model for a linear model. Other model functions work differently … major headaches!

summary(ff_lm)

Call:
lm(formula = rating ~ type + treatment + time + subject, data = ff_long)

Residuals:
   Min     1Q Median     3Q    Max 
-7.073 -1.967 -0.464  1.520 10.219 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.1985     0.2579    4.65  3.5e-06 ***
typegrassy   -1.1551     0.1557   -7.42  1.5e-13 ***
typepainty    0.7005     0.1558    4.50  7.1e-06 ***
typepotato    5.1332     0.1557   32.96  < 2e-16 ***
typerancid    2.0329     0.1557   13.06  < 2e-16 ***
treatment2   -0.0513     0.1205   -0.43    0.670    
treatment3   -0.0455     0.1206   -0.38    0.706    
time2         0.1019     0.2161    0.47    0.637    
time3        -0.0136     0.2161   -0.06    0.950    
time4        -0.1025     0.2161   -0.47    0.635    
time5        -0.1330     0.2169   -0.61    0.540    
time6        -0.0461     0.2161   -0.21    0.831    
time7        -0.2462     0.2163   -1.14    0.255    
time8        -0.1163     0.2166   -0.54    0.591    
time9         0.1319     0.2278    0.58    0.563    
time10        0.5463     0.2278    2.40    0.017 *  
subject10     1.7145     0.2439    7.03  2.5e-12 ***
subject15    -0.3591     0.2449   -1.47    0.143    
subject16     0.4752     0.2441    1.95    0.052 .  
subject19     2.0165     0.2439    8.27  < 2e-16 ***
subject31     1.4928     0.2510    5.95  3.0e-09 ***
subject51     1.8728     0.2439    7.68  2.1e-14 ***
subject52     0.1948     0.2439    0.80    0.424    
subject63     0.9605     0.2439    3.94  8.4e-05 ***
subject78    -0.5828     0.2439   -2.39    0.017 *  
subject79    -0.5370     0.2503   -2.15    0.032 *  
subject86     0.4354     0.2510    1.73    0.083 .  
---
Signif. codes:  
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.9 on 3444 degrees of freedom
  (9 observations deleted due to missingness)
Multiple R-squared:   0.4,  Adjusted R-squared:  0.395 
F-statistic: 88.1 on 26 and 3444 DF,  p-value: <2e-16

Tidying model output

The package broom (broom.mixed for glmmTMB) gets model results into a tidy format at different levels

  • model level: broom::glance (values such as AIC, BIC, model fit, …)
  • coefficients in the model: broom::tidy (estimate, confidence interval, significance level, …)
  • observation level: broom::augment (fitted values, residuals, predictions, influence, …)

Goodness of fit statistics: glance

glance(ff_lm)
# A tibble: 1 × 12
  r.squared adj.r.squared sigma statistic p.value    df
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
1     0.400         0.395  2.90      88.1       0    26
# ℹ 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#   deviance <dbl>, df.residual <int>, nobs <int>

Model estimates: tidy

ff_lm_tidy <- tidy(ff_lm)
head(ff_lm_tidy)
# A tibble: 6 × 5
  term        estimate std.error statistic   p.value
  <chr>          <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)   1.20       0.258     4.65  3.50e-  6
2 typegrassy   -1.16       0.156    -7.42  1.49e- 13
3 typepainty    0.701      0.156     4.50  7.11e-  6
4 typepotato    5.13       0.156    33.0   2.28e-207
5 typerancid    2.03       0.156    13.1   4.69e- 38
6 treatment2   -0.0513     0.121    -0.426 6.70e-  1

Model diagnostics: augment

ff_lm_all <- augment(ff_lm)
glimpse(ff_lm_all)
Rows: 3,471
Columns: 12
$ .rownames  <chr> "1", "2", "3", "4", "5", "6", "7", "8",…
$ rating     <dbl> 2.9, 0.0, 0.0, 0.0, 5.5, 14.0, 0.0, 0.0…
$ type       <chr> "potato", "buttery", "grassy", "rancid"…
$ treatment  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ time       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ subject    <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 10, 10, 1…
$ .fitted    <dbl> 6.332, 1.199, 0.043, 3.231, 1.899, 6.33…
$ .resid     <dbl> -3.432, -1.199, -0.043, -3.231, 3.601, …
$ .hat       <dbl> 0.0079, 0.0079, 0.0079, 0.0079, 0.0079,…
$ .sigma     <dbl> 2.9, 2.9, 2.9, 2.9, 2.9, 2.9, 2.9, 2.9,…
$ .cooksd    <dbl> 4.2e-04, 5.1e-05, 6.7e-08, 3.7e-04, 4.6…
$ .std.resid <dbl> -1.188, -0.415, -0.015, -1.119, 1.247, …

Residual plot

ggplot(ff_lm_all, aes(x=.fitted, y=.resid)) + geom_point()

Adding random effects

Add random intercepts for each subject

fries_lmer <- lmer(rating~type+treatment+time + ( 1 | subject ), 
data = ff_long)

Your turn

  • Run the code of the examples so far

The broom.mixed package provides similar functionality to mixed effects models as broom does for fixed effects models

  • Try out the functionality of broom.mixed on each level: model, parameters, and data
  • Augment the ff_long data with the model diagnostics
  • Plot the residuals from the random effects model
  • Plot fitted values against observed values

05:00

Statistical models

Computational models

Forecasting

Resources